Skip to content

Conversation

@jacobthebanana
Copy link
Contributor

@jacobthebanana jacobthebanana commented Mar 7, 2024

Ensures the LoRA ID is a part of the hash used for prefix blocks.

@jacobthebanana
Copy link
Contributor Author

Example unit test output with the revised test case and without the fix (see commit 3441735).

  • test_auto_prefix_caching passes when either the request specifies one lora adapter, or when no adapters was requested.
  • test_auto_prefix_caching does not pass when subsequent requests specify different adapters (or one request without adapter and one request with lora adapter enabled.)
$ git reset --hard 3441735
> HEAD is now at 3441735 Added test case of lora block_hash conflict.
$ pytest tests/test_cache_block_hashing.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.10.12, pytest-8.0.2, pluggy-1.4.0
plugins: forked-1.6.0, anyio-4.3.0, rerunfailures-13.0, asyncio-0.23.5
asyncio: mode=strict
collected 5 items                                                                                                                              

tests/test_cache_block_hashing.py ..FFF                                                                                                  [100%]

=================================================================== FAILURES ===================================================================
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1]

...

        for hash0, hash1 in zip(flatten_2d(hashes[0]), flatten_2d(hashes[1])):
>           assert (hash0 != hash1)
E           assert 6230683134333785342 != 6230683134333785342

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [None, 1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
_________________________________ test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] __________________________________

model = 'facebook/opt-125m', block_size = 16, max_num_seqs = 256, concurrent_lora_int_ids = [1, 2]
...

tests/test_cache_block_hashing.py:84: AssertionError
=========================================================== short test summary info ============================================================
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids2-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids3-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
FAILED tests/test_cache_block_hashing.py::test_auto_prefix_caching[concurrent_lora_int_ids4-256-16-facebook/opt-125m] - assert 6230683134333785342 != 6230683134333785342
==================================================== 3 failed, 2 passed, 1 warning in 1.47s ====================================================

@jacobthebanana jacobthebanana marked this pull request as ready for review March 7, 2024 22:02
@jacobthebanana
Copy link
Contributor Author

This PR closes #3264

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's exactly how it should be implemented!

@Yard1 Yard1 enabled auto-merge (squash) March 7, 2024 22:06
@Yard1 Yard1 merged commit 8cbba46 into vllm-project:main Mar 7, 2024
AdrianAbeyta pushed a commit to AdrianAbeyta/vllm that referenced this pull request Mar 8, 2024
dtransposed pushed a commit to afeldman-nm/vllm that referenced this pull request Mar 26, 2024
@JJEccles
Copy link

Hi guys I'm looking for a solution for this issue but for the openai server calls where I request the Lora adapter in my post request. This is the command I use to get my server started:

vllm serve unsloth/Llama-3.2-3B
--tokenizer unsloth/Llama-3.2-3B
--port 8000
--max-model-len 2048
--enable-lora
--lora-modules profile_adapter=adapters_tokenizer_profile ingredientslist_adapter=adapters_tokenizer_list_ing
--max-lora-rank 64

And I was wondering If it's possible to then either adjust this server command or change something in the request for inference on the user side to be able to stop the caching affecting the responses when directly switching from one to another in inference calls. I'm hoping there is something I can add in the command to open the server that can solve this issue. If anyone could point me in the right direction It would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants